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Abstract: The Support Vector Machine (SVM) is a popular classification 
paradigm in machine learning and has achieved great success in real appli- 
cations. However, the standard SVM can not select variables automatically 
and therefore its solution typically utilizes all the input variables without 
discrimination. This makes it difficult to identify important predictor vari- 
ables, which is often one of the primary goals in data analysis. In this paper, 
we propose two novel types of regularization in the context of the multi- 
category SVM (MSVM) for simultaneous classification and variable selec- 
tion. The MSVM generally requires estimation of multiple discriminating 
functions and applies the argmax rule for prediction. For each individual 
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variable, we propose to characterize its importance by the supnorm of its 
coefficient vector associated with different functions, and then minimize the 
MSVM hinge loss function subject to a penalty on the sum of supnorms. To 
further improve the supnorm penalty, we propose the adaptive rcgulariza- 
tion, which allows different weights imposed on different variables according 
to their relative importance. Both types of regularization automate vari- 
able selection in the process of building classifiers, and lead to sparse multi- 
classifiers with enhanced intcrprctability and improved accuracy, especially 
for high dimensional low sample size data. One big advantage of the sup- 
norm penalty is its easy implementation via standard linear programming. 
Several simulated examples and one real gene data analysis demonstrate 
the outstanding performance of the adaptive supnorm penalty in various 
data settings. 
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1. Introduction 

In supervised learning problems, we are given a training set of n examples from 
K > 2 different populations. For each example in the training set, we observe 
its covariate € M. d and the corresponding label y.i indicating its membership. 
Our ultimate goal is to learn a classification rule which can accurately predict 
the class label of a future example based on its covariate. Among many clas- 
sification methods, the Support Vector Machine (SVM) has gained much pop- 
ularity in both machine learning and statistics. The seminal work by Vapnik 
(1995, 1998) has laid the foundation for the general statistical learning the- 
ory and the SVM, which furthermore inspired various extensions on the SVM. 
For other references on the binary SVM, see Christianini and Shawc- Taylor 
(2000), Scholkopf and Smola (2002), and references therein. Recently a few at- 
tempts have been made to generalize the SVM to multiclass problems, such 
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as Vapnik (1998), Weston and Watkins (1999), Crammer and Singer (2001), 
Lee et al. (2004), Liu and Shen (2006), and Wu and Liu (2007a). 

While the SVM outperforms many other methods in terms of classification 
accuracy in numerous real problems, the implicit nature of its solution makes 
it less attractive in providing insight into the predictive ability of individual 
variables. Often times, selecting relevant variables is the primary goal of data 
mining. For the binary SVM, Bradley and Mangasarian (1998) demonstrated 
the utility of the L\ penalty, which can effectively select variables by shrinking 
small or redundant coefficients to zero. Zhu et al. (2003) provides an efficient al- 
gorithm to compute the entire solution path for the Li-norm SVM. Other forms 
of penalty have also been studied in the context of binary SVMs, such as the Lq 
penalty (Weston et al., 2003), the SCAD penalty (Zhang et al., 2006), the L q 
penalty (Liu et al., 2007), the combination of Lq and L\ penalty (Liu and Wu, 
2007), the combination of L\ and L-2 penalty (Wang et al., 2006), the norm 
(Zou and Yuan, 2006), and others (Zhao et al., 2006; Zou, 2006). 

For multiclass problems, variable selection becomes more complex than the 
binary case, since the MSVM requires estimation of multiple discriminating 
functions, among which each function has its own subset of important predictors. 
One natural idea is to extend the L\ SVM to the L\ MSVM, as done in the recent 
work of Lee et al. (2006) and Wang and Shen (2007b). However, the L\ penalty 
does not distinguish the source of coefficients. It treats all the coefficients equally, 
no matter whether they correspond to the same variable or different variables, 
or they are more likely to be relevant or irrelevant. In this paper, we propose 
a new regularized MSVM for more effective variable selection. In contrast to 
the L\ MSVM, which imposes a penalty on the sum of absolute values of all 
coefficients, we penalize the sup-norm of the coefficients associated with each 
variable. The proposed method is shown to be able to achieve a higher degree 
of model parsimony than the Li MSVM without compromising classification 
accuracy. 

This paper is organized as follows. Section 2 formulates the sup-norm regu- 
larization for the MSVM. Section 3 proposes an efficient algorithm to implement 
the MSVM. Section 4 discusses an adaptive approach to improve performance 
of the sup-norm MSVM by allowing different penalties for different covariates 
according to their relative importance. Numerical results on simulated and gene 
expression data are given in Sections 5 and 6, followed by a summary. 

2. Methodology 

In if-catcgory classification problems, we code y as {1,. . . ,K} and define f = 
(/i, . . . , Jk) as a decision function vector. Each a mapping from the input 
domain R d to R, represents the strength of the evidence that an example with 
input x belongs to the class k; k = 1, . . . , K, A classifier induced by f , 

<f>(x) = arg max / fc (x), 

k— 1 , ... ,71 
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assigns an example with x to the class with the largest /fe(x). We assume the 
n training pairs {(Xj,j/i),i = l,...,n} are independently and identically dis- 
tributed according to an unknown probability distribution P(x, y). Given a 
classifier f, its performance is measured by the generalization error, GE(f) = 
P(Y £ argmax fc / fc (X)) = £ (x ,y)[/(Y ^ argmax fc / fc (X))]. 

Let J5fc(x) = Pr(Y" = fc|X = x) be the conditional probability of class k given 
X = x. The Bayes rule which minimizes the GE is then given by 

B (x) = arg min [1 - p k (x)] = arg max pfc(x). (2.1) 

k—l,...,K k=l,...,K 

For nonlinear problems, we assume /fe(x) = bk + X)|=i w kjhj{'x.) using a set 
of basis functions {hj(x)}. This linear representation of a nonlinear classifier 
through basis functions will greatly facilitate the formulation of the proposed 
method. Alternatively nonlinear classifiers can also be achieved by applying the 
kernel trick (Boser et al., 1992). However, the kernel classifier is often given as 
a black box function, where the contribution of each individual covariate to the 
decision rule is too implicit to be characterized. Therefore we will use the basis 
expansion to construct nonlinear classifiers in the paper. 

The standard multicategory SVM (MSVM; Lee ct al., 2004) solves 

n K K d 

m f in -EE ^ * fc )^ x '<) + !]+ + A E E <•> ( 2 - 2 ) 

»=i fe=i k=i j=i 

under the sum-to-zero constraint Ylk=i fk ~ 0- The sum-to-zero constraint used 
here is to follow Lee et al. (2004) in their framework for the MSVM. It is imposed 
to eliminate redundancy in //-'s and to assure identifiability of the solution. This 
constraint is also a necessary condition for the Fisher consistency of the MSVM 
proposed by Lee et al. (2004). To achieve variable selection, Wang and Shcn 
(2007b) proposed to impose the L\ penalty on the coefficients and the corre- 
sponding L\ MSVM then solves 

n K K d 

b.w n L — ' 1 — ' i — ' * — ' 

i=i fc=i fe=i j=i 

under the sum-to-zero constraint. For linear classification rules, we start with 
/fe(x) = bk + WkjXj, k = 1, . . . , K. The sum-to-zcro constraint then be- 

comes 

K K 

^6 fc = 0, J2 w kj=°, j = l,.-.,d. (2.4) 

fe=i fe=i 

The L\ MSVM treats all Wkj's equally without distinction. As opposed to 
this, we take into account the fact that some of the coefficients are associated 
with the same covariate, therefore it is more natural to treat them as a group 
rather than separately. 

Define the weight matrix W of size K x d such that its (k,j) entry is Wkj- 
The structure of W is shown as follows: 
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Throughout the paper, we use w^. = (wki , ■ ■ ■ , Wkd) T to represent the fcth 
row vector of W, and W(j) = (wij, . . . ,WKj) T for the jth column vector of 
W . According to Crammer and Singer (2001), the value bk + w£x defines the 
similarity score of the class k, and the predicted label is the index of the row 
attaining the highest similarity score with x. We define the sup-norm for the 
coefficient vector w^-j as 

ll w (i)l|oo = k ™ X K \ Wk j\- ( 2 ' 5 ) 

In this way, the importance of each covariate Xj is directly controlled by its 
largest absolute coefficient. Wc propose the sup-norm regularization for MSVM: 

^ n K d 

min -J^y^ 1 ^ ^fc)[^fc+w^x l + l] + + AV||w ( - ) |j 00 , 

b,w n z — — ' L — ' 

i=i fe=i j=i 

subject to l T b = 0, l T w^ =0, for j = 1, . . . , d, (2.6) 

where b = (bi, . . . , bx ) T - 

The sup-norm MSVM encourages more sparse solutions than the L\ MSVM, 
and identifies important variables more precisely. In the following, we describe 
the main motivation of the sup-norm MSVM, which makes it more attractive for 
variable selection than the L\ MSVM. Firstly, with a sup-norm penalty, a noise 
variable is removed if and only if all corresponding K estimated coefficients are 
0. On the other hand, if a variable is important with a positive sup-norm, the 
sup-norm penalty, unlike the L\ penalty, does not put any additional penalties 
on the other K — 1 coefficients. This is desirable since a variable will be kept in 
the model as long as the sup-norm of the K coefficients is positive. No further 
shrinkage is needed for the remaining coefficients in terms of variable selection. 
For illustration, we plot the region < t\ + t% < C in Figure 1, where t\ = 
max(wn, W21, W31, W41) and t<i = max(wi2, W22, W32, 1^42)- Clearly, the sup-norm 
penalty shrinks sum of two maximums corresponding to two variables. This helps 
to lead to more parsimonious models. In short, in contrast to the L\ penalty, 
the sup-norm utilizes the group information of the decision function vector and 
consequently the sup-norm MSVM can deliver better variable selection. 

For three-class problems, wc show that the L\ MSVM and the new proposed 
sup-norm MSVM give identical solutions after adjusting the tuning parameters, 
which is due to the sum-to-zero constraints on w^) 's. This equivalence, however, 
does not hold for the adaptive procedures introduced in Section 4. 
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Fig 1. Illustrative plot of the shrinkage property of the sup-norm. 



Proposition 2.1. When K = 3, the L x MSVM (2.3) and the sup-norm MSVM 
(2.6) are equivalent. 

When K > 3, our empirical experience shows that the sup- norm MSVM 
generally performs well in terms of classification accuracy. 

Here we would like to point out two fundamental differences between the sup- 
norm penalty and the Foo penalty used for group variable selection (Zhao et ah, 
2006; Zou and Yuan, 2006) considering their similar expressions. The purpose 
of group selection is to select several prediction variables altogether if these 
predictors work as a group. Therefore, each Foo term in Zou and Yuan (2006) 
is based on the regression coefhcieiits of several variables which belong to one 
group, whereas each supnorm penalty in (2.6) is associated with only one pre- 
diction variable. Secondly, in the implementation of the F^, one has to decide 
in advance the number of groups and which variables belong to a certain group, 
whereas in the supnorm SVM each variable is naturally associated with its own 
group and the number of groups is same as the number of covariates. 

As a remark, we point out that Argyriou et al. (2007, 2006) proposed a similar 
penalty for the purpose of multi-task feature learning. Specifically, they used a 
mixture of L\ and L2 penalties. They first applied the L2 penalty for each 
feature across different tasks and then used the L\ penalty for feature selection. 
In contrast, our penalty is a combination of the L\ and supnorm penalties for 
multicategory classification. 

The tuning parameter A in (2.6) balances the tradeoff between the data fit 
and the model parsimony. A proper choice of A is important to assure good per- 
formance of the resulting classifier. If the chosen A is too small, the procedure 
tends to overfit the training data and gives a less sparse solution; on the other 
hand, if A is too large, the solution can become very sparse but possibly with 
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a low prediction power. The choice of the tuning parameter is typically done 
by minimizing cither an estimate of generalization error or other related perfor- 
mance measures. In simulations, we generate an extra independent tuning set to 
choose the best A. For real data analysis, we use leave-one-out cross validation 
of the misclassification rate to select A. 

3. Computational Algorithms 

In this section we show that the optimization problem (2.6) can be converted to 
a linear programming (LP) problem, and can therefore be solved using standard 
LP techniques in polynomial time. This great computational advantage is very 
important in real applications, especially for large data sets. 

Let A be an n x K matching matrix with its entry a,ik = I{yi ^ k) for 
i = 1, . . . , n and fc = 1, . . . , K . First we introduce slack variables such that 

&fc = [frfc + x, + l] + for i = l,...,n; k=l,...,K. (3.1) 
The optimization problem (2.6) can be expressed as 

n K d 

b m i n f -mi a ^+ A iiii w ( j )ii~> 

■ i=i k=i j=i 

subject to l T b = 0, l T wy) = 0, j = l,...,d, 

Zik > h + w^Xi + 1, e*fc>0, z = l,...,n; k = 1, . . . , K. (3.2) 

To further simplify (3.2), we introduce a second set of slack variables 

Vj = l|w (j) ||oo = k mzx K \wkj\, 
which add some new constraints to the problem: 

\u>kj \<Vj, for k = l,...,K; j = 1, . . . , d. 

Finally write Wkj = w^- — w kji where w^j and tujT. denote the positive and 
negative parts of Wkj , respectively. Similarly, and xvj respectively consist of 
the positive and negative parts of components in Denote r\ — (r]i, . . . , r]d) T ; 
then (3.2) becomes 

^ n K d 

mm - Y] V ai k ti k + A V rjj , 

b,w.£,77 n L — ' z — ' ' — ' 

i=l fe=l 3=1 

subject to l T b = 0, l T [w+ - w^.J = 0, j 
6fc>fofe + [w+-w,7] T x 4 + l, fifc>0, i 
w 0-) +w 0-)^> w O")^ ' w 0)^ ' 3 



= l,...,d, 

= 1, . . . , n; k = 1, . . . , K, 
= l,...,d. (3.3) 
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4. Adaptive Penalty 

In (2.3) and (2.6), the same weights are used for different variables in the penalty 
terms, which may be too restrictive, since a smaller penalty may be more desired 
for those variables which are so important that we want to retain them in the 
model. In this section, we suggest that different variables should be penalized 
differently according to their relative importance. Ideally, large penalties should 
be imposed on redundant variables in order to eliminate them from models more 
easily; and small penalties should be used on important variables in order to 
retain them in the final classifier. Motivated by this, we consider the following 
adaptive L x MSVM: 

. n K K d 

^ ~ E E ^ fc ) + w fc Xi + *]+ + A EE Tk J \ wk i i ' 

' j=l fc=l k=l j = l 

subject to l T b = 0, l T w (i) = 0, for j = 1, . . . , d, (4.1) 

where Tkj > represents the weight for coefficient Wkj- 

Adaptive shrinkage for each variable has been proposed and studied in var- 
ious contexts of regression problems, including the adaptive LASSO for linear 
regression (Zou, 2006), proportional hazard models (Zhang and Lu, 2007), and 
quantile regression (Wang ct al., 2007; Wu and Liu, 2007b). In particular, Zou 
(2006) has established the oracle property of the adaptive LASSO and justified 
the use of different amounts of shrinkage for different variables. Due to the spe- 
cial form of the sup-norm SVM, we consider the following two ways to employ 
the adaptive penalties: 

[I] 



n K 



mm 

b,w n 



i " 

-EE J (w ^ fc ) ^ + w * x * + 1 i+ + A E' 



i=i fc=i j=i 



subject to l T b = 0, l T w (j) =0, for j = 1, . . . , d, (4.2) 

[II] 



n K 



-EE 7 (w ^ fc )t fefc + w fe x ' + !]+ + A E ii ( rw )o-)ii°°' 

i=i fc=i j=i 

subject to l T b = 0, l T w (j) =0, for j = 1, . . . , d, (4.3) 



mm 
b,w n 



where the vector (tw)^) = (jijWij, . . . , TKjWKj) T for j = I, d. 

In (4.1), (4.2), and (4.3), the weights can be regarded as leverage factors, 
which are adaptively chosen such that large penalties are imposed on coeffi- 
cients of unimportant covariates and small penalties on coefficients of important 
ones. Let w be the solution to standard MSVM (2.2) with the L2 penalty. Our 
empirical experience suggests that 

1 

Tkj = -rz — r 
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is a good choice for (4.1) and (4.3), and 

1 

3 ||wy)||oo 

is a good choice for (4.2). If Wkj = 0, which implies the infinite penalty on Wkj, 
we set the corresponding coefficient solution Wkj to be zero. 

In terms of computational issues, all three problems (4.1), (4.2), and (4.3) 
can be solved as LP problems. Their entire solution paths may be obtained by 
some modifications of the algorithms in Wang and Shen (2007b). 

5. Simulation 

In this section, we demonstrate the performance of six MSVM methods: the 
standard L 2 MSVM, L\ MSVM, sup-norm MSVM, adaptive L x MSVM, and 
the two adaptive sup-norm MSVMs. Three simulation models are considered: 
(1) a linear example with five classes; (2) a linear example with four classes; (3) a 
nonlinear example with three classes. In each simulation setting, n observations 
are simulated as the training data, and another n observations are generated 
for tuning the regularization parameter A for each procedure. Therefore the 
total sample size is 2n for obtaining the final classifiers. To test the accuracy 
of the classification rules, we also independently generate n' observations as a 
test set. The tuning parameter A is selected via a grid search over the grid: 
log 2 (A) = —14, —13, . . . , 15. When a tie occurs, we choose the larger value of A. 
As we suggest in Section 4, we use the L 2 MSVM solution to derive the weights 
in the adaptive MSVMs. The L 2 MSVM solution is the final tuned solution using 
the separate tuning set. Once the weights are chosen, we tune the parameter A 
in the adaptive procedure via the tuning set. 

We conduct 100 simulations for each classification method under all settings. 
Each fitted classifier is then evaluated in terms of its classification accuracy and 
variable selection performance. For each method, we report its average testing 
error, the number of correct and incorrect zero coefficients among Kd coeffi- 
cients, the model size as the number of important ones among the d variables, 
and the number of times that the true model is correctly identified. The num- 
bers given in the parentheses in the tables are the standard errors of the testing 
errors. We also summarize the frequency of each variable being selected over 100 
runs. All simulations are done using the optimization software CPLEX with the 
AMPL interface (Fourer et al., 2003). More information about CPLEX can be 
found on the ILOG website http://www.ilog.com/products/optimization/. 

5.1. Five- Class Example 

Consider a five-class example, with the input vector x in a 10-dimensional space. 
The first two components of the input vector are generated from a mixture 
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Gaussian in the following way: for each class k, generate (x\, x%) independently 
from N(ix k ,<j\l2) 1 with 

Pi = 2(cos([2fc- l]7r/5),sin([2fc- 1]tt/5)), k = 1,2,3,4,5, 

and the remaining eight components are i.i.d. generated from ./V(0,ct|). We 
generate the same number of observations in each class. Here a\ = \/2, a<i = 
l,n= 250, and n' = 50,000. 

Table 1 

Classification and variable selection results for the five-class example. TE, CZ, IZ, MS, and 
CM refer to the testing error, the number of correct zeros, the number of incorrect zeros, the 
model size, and the number of times that the true model is correctly identified, respectively. 



Method 


TE 


CZ 


IZ 


MS 


CM 


L2 


0.454 (0.034) 


0.00 


0.00 


10.00 





LI 


0.558 (0.022) 


24.88 


2.81 


6.60 


21 


Adapt-Ll 


0.553 (0.020) 


30.23 


2.84 


5.14 


40 


Supnorm 


0.453 (0.020) 


33.90 


0.01 


3.39 


68 


Adapt-supl 


0.455 (0.024) 


39.92 


0.01 


2.08 


98 


Adapt-supII 


0.457 (0.046) 


39.40 


0.09 


2.17 


97 


Bayes 


0.387 (— ) 


41 





2 


100 



Table 1 shows that, in terms of classification accuracy, the L2 MSVM, the 
supnorm MSVM, and the two adaptive supnorm MSVMs are among the best 
and their testing errors arc close to each other. In terms of other measurements 
such as the number of correct/incorrect zeros, the model size, and the number of 
times that the true model is correctly identified, the supnorm MSVM procedures 
work much better than other MSVM methods. 

Table 2 shows the frequency of each variable being selected by each procedure 
in 100 runs. The type I sup-norm MSVM performs the best among all. Over- 
all the adaptive MSVMs show significant improvement over the non-adaptive 
classifiers in terms of both classification accuracy and variable selection. 

Table 2 

Variable selection frequency results for the five-class example. 

Selection Frequency 



Method xi X2 X3 24 2:5 x% xt xg xg xio 



L2 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


LI 


100 


100 


59 


55 


60 


58 


56 


61 


57 


54 


Adapt-Ll 


100 


100 


11 


10 


13 


37 


39 


11 


35 


35 


Supnorm 


100 


100 


15 


17 


20 


17 


11 


20 


17 


19 


Adapt-supl 


100 


100 


1 


1 





2 


1 


1 


1 


1 


Adapt-supII 


100 


100 


2 


2 


2 


2 


2 


2 


3 


2 



5.2. Four- Class Linear Example 

In the simulation example in Section 5.1, the informative variables are important 
for all classes. In this section, we consider an example where the informative vari- 
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ables are important for some classes but not important for other classes. Specif- 
ically, we generate four i.i.d important variables x\, X2, X3, X4 from Unif[— 1,1] 
as well as six independent i.i.d noise variables x$, ... , xio from iV(0, 8 2 ). Define 
the functions 

fi = -5xi + 5.T4, 
f 2 = 5xi + 5a;2, 
fa = -5.T 2 + 5x3, 

fi = -5.T3 - 5X4, 

and set Pfc(x) = P(Y = k\X = x) cx exp(/fe(x)),fc = 1,2,3,4. In this example, 
we set n = 200 and n' = 40, 000. Note that x\ is not important for distinguishing 
class 3 and class 4. Similarly, x 2 is noninformativc for class 1 and class 4, X3 is 
noninformativc for class 1 and class 2, and X4 is noninformativc for class 2 and 
class 3. 

Table 3 

Classification and variable selection results for the four-class linear example. 



Method 


TE 


CZ 


IZ 


MS 


CM 


L2 


0.336 (0.063) 


0.0000 


0.0000 


10.00 





LI 


0.340 (0.069) 


2.5100 


0.1600 


9.99 





Adapt-Ll 


0.320 (0.079) 


18.2300 


0.2600 


7.21 


21 


Supnorm 


0.332 (0.070) 


0.8500 


0.1400 


9.98 





Adapt-supl 


0.327 (0.076) 


9.3300 


0.1400 


7.83 


15 


Adapt-supII 


0.326 (0.071) 


9.9000 


0.1400 


7.69 


9 


Baycs 


0.1366 (— ) 


32 





4 


100 



Table 3 summarizes the performance of various procedures, and Table 4 shows 
the frequency of each variable being selected by each procedure in 100 runs. Due 
to the increased difficulty of this problem, the performances of all methods are 
not as good as that of the five-class example. From these results, we can see that 
the adaptive procedures work better than the non-adaptive procedures both in 
terms of both classification accuracy and variable selection. Furthermore, the 
adaptive L\ MSVM performs the best overall. This is due to the difference be- 
tween the L\ and the supnorm penalties. Our proposed supnorm penalty treats 
all coefficients of one variable corresponding to different classes as a group and 
removes the variable if it is non-informative across all class labels. By design of 
this example, important variables have zero coefficients for certain classes. As a 
result, our supnorm penalty does not deliver the best performance. Nevertheless, 
the adaptive supnorm procedures still perform reasonably. 



H.H. Zhang et al. /Variable selection for multicategory SVM 160 
Table 4 

Variable selection frequency results for the four-class example. 

Selection Frequency 



Method x\ X2 X3 X4 £5 x% X7 xg xg xio 



L2 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


LI 


100 


100 


100 


100 


100 


90 


100 


100 


100 


100 


Adapt-Ll 


100 


100 


100 


100 


55 


53 


59 


56 


49 


49 


Supnorm 


100 


100 


100 


100 


100 


99 


100 


100 


100 


99 


Adapt-supl 


100 


100 


100 


100 


67 


61 


71 


60 


58 


63 


Adapt-supII 


100 


100 


100 


100 


65 


66 


65 


58 


56 


59 




-3-2-10123 



Fig 2. The Bayes boundary for the nonlinear three- class example. 

5.3. Nonlinear Example 

In this nonlinear 3-class example, we first generate x\ ~ Unif[— 3, 3] and xi ~ 
Unif[— 6, 6]. Define the functions 

/1 = -2xi + 0.2a-? - O.lxa + °- 2 > 

f 2 = -OAxf + Q.2x\ - 0.4, 

f 3 = 2xt +0.2x1 - O.lxj + 0.2, 

and set pfe(x) = P(Y = k\X = x) c>c exp(/fe(x)), k = 1, 2, 3. The Bayes boundary 
is plotted in Figure 2. We also generate three noise variables x% ~ N(0,a 2 ), 
i = 3, 4, 5. In this example, we set a = 2 and n' = 40, 000. 

To achieve nonlinear classification, we fit the nonlinear MSVM by including 
the five main effects, their square terms, and their cross products as the basis 
functions. The results with n = 200 arc summarized in Tables 5 and 6. Clearly, 
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Table 5 

Classification and variable selection results using second order polynomial basis functions 
for the nonlinear example in Section 5.3 with n = 200. 



Method 


TE 


CZ 


IZ 


MS 


CM 


L2 


0.167 (0.013) 


0.00 


0.00 


20.00 





LI 


0.151 (0.012) 


21.42 


0.03 


14.91 





Adapt-Ll 


0.140 (0.010) 


43.13 


0.00 


6.92 


31 


Supnorm 


0.150 (0.012) 


22.70 


0.01 


14.43 





Adapt-supl 


0.140 (0.010) 


40.84 


0.00 


7.21 


31 


Adapt-supII 


0.140 (0.011) 


41.50 


0.00 


6.21 


36 


Bayes 


0.120 (— ) 


52 





3 


100 



Table 6 

Variable selection frequency results for the nonlinear example using second order polynomial 

basis functions with n = 200. 



Selection Frequency 



Method 


xx 


x\ 




X2 


^3 


X4 


%5 




3^4 


4 


L2 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


LI 


100 


100 


100 


69 


44 


50 


43 


80 


84 


89 


Adapt-Ll 


100 


100 


100 


33 


21 


21 


20 


21 


18 


22 


Supnorm 


100 


100 


100 


67 


37 


42 


34 


81 


80 


75 


Adapt-supl 


100 


100 


100 


31 


21 


21 


26 


21 


25 


21 


Adapt-supII 


100 


100 


100 


22 


18 


12 


19 


18 


16 


18 




XiX 2 


X\XZ 


X\X^ 


X1X5 


X2%3 


X2X4 




X 3 X4 






L2 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


LI 


80 


55 


57 


65 


86 


88 


90 


69 


72 


70 


Adapt-Ll 


31 


20 


18 


20 


28 


26 


31 


20 


17 


22 


Supnorm 


79 


62 


58 


55 


87 


89 


91 


62 


68 


73 


Adapt-supl 


31 


22 


17 


28 


3,0 


29 


30 


21 


16 


25 


Adapt-supII 


25 


15 


14 


19 


30 


23 


22 


16 


17 


17 



the adaptive L\ SVM and the two adaptive sup-norm SVMs deliver more ac- 
curate and sparse classifiers than the other methods. In this example, there 
are correlations among covariates and consequently the variable selection task 
becomes more challenging. This difficulty is reflected in the variable selection 
frequency reported in Table 6. Despite the difficulty, the adaptive procedures 
are able to remove noise variables reasonably well. 

To examine the performance of various methods using a richer set of basis 
functions, we also fit nonlinear MSVMs via polynomial basis of degree 3 with 
55 basis functions. Results of classification and variable selection with n = 200 
and 400 are reported in Tables 7 and 8 respectively. Compared with the case of 
the second order polynomial basis, classification testing errors using the third 
order polynomial basis are much larger for the L2, L\, and supnorm MSVMs, 
but similar for the adaptive procedures. Due to the large basis set, none of the 
methods can identify the correct model. However, the adaptive procedures can 
eliminate more noise variables than the non- adaptive procedures. This further 
demonstrates the effectiveness of adaptive weighting. The results of variable 
selection frequency (not reported due to lack of space) show a similar pattern 
as that of the second order polynomial. When n increases from 200 and 400, 
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Table 7 

Classification and variable selection results using third order polynomial basis functions for 
the nonlinear example in Section 5.3 with n = 200. 



Method 


TE 


CZ 


IZ 


MS 


CM 


L2 


0.213 (0.018) 


0.00 


0.00 


55.00 





LI 


0.170 (0.015) 


59.22 


0.57 


40.44 





Adapt-Ll 


0.138 (0.015) 


120.71 


0.17 


19.28 





Supnorm 


0.171 (0.015) 


60.08 


0.61 


40.06 





Adapt-supl 


0.141 (0.016) 


114.29 


0.17 


20.22 





Adapt-supII 


0.142 (0.015) 


106.78 


0.22 


19.75 





Bayes 


0.120 (— ) 


157 





3 


100 



Table 8 

Classification and variable selection results using third order polynomial basis functions for 
the nonlinear example in Section 5.3 with n = 400. 



Method 


TE 


CZ 


IZ 


MS 


CM 


L2 


0.162 (0.008) 


0.00 


0.00 


55.00 





LI 


0.143 (0.008) 


60.13 


0.34 


40.50 





Adapt-Ll 


0.124 (0.004) 


139.71 


0.00 


11.01 





Supnorm 


0.144 (0.010) 


60.51 


0.32 


40.24 





Adapt-supl 


0.125 (0.005) 


139.41 


0.00 


10.37 





Adapt-supII 


0.125 (0.004) 


132.96 


0.00 


10.96 





Bayes 


0.120 (— ) 


157 





3 


100 



we can see that classification accuracy for all methods increases as expected. 
Interestingly, compared to the case of n = 200, the performance of variable 
selection with n — 400 for non-adaptive procedures stays relatively the same, 
while improves dramatically for the adaptive procedures. 

6. Real Example 

DNA microarray technology has made it possible to monitor mRNA expressions 
of thousands of genes simultaneously. In this section, we apply our six different 
MSVMs on the children cancer data set in Khan ct al. (2001). Khan ct al. (2001) 
classified the small round blue cell tumors (SRBCTs) of childhood into 4 classes; 
namely neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lym- 
phoma (NHL), and the Ewing family of tumors (EWS) using cDNA gene expres- 
sion profiles. After filtering, 2308 gene profiles out of 6567 genes are given in the 
data set, available at http : // research . nhgri . nih . gov/microarray/Supplement/. 
The data set includes a training set of size 63 and a test set of size 20. The dis- 
tributions of the four distinct tumor categories in the training and test sets are 
given in Table 9. Note that Burkitt lymphoma (BL) is a subset of NHL. 

To analyze the data, we first standardize the data sets by applying a simple 
linear transformation based on the training data. Specifically, we standardize 
the expression x g i of the g-th gene of subject i to obtain x g i by the following 
formula: 

1 v^™ 

_ X 9i ~ n 2-<j=l X 9i 

S(l\X g\ , ' ' ' , Xg n ) 
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Table 9 

Class distribution of the microarray example. 



Data set 


NB 


RMS 


BL 


EWS 


Total 


Training 


12 


20 


8 


23 


63 


Test 


6 


5 


3 


6 


20 



Table 10 

Classification results of the microarray data using 200 genes. 

Selected genes 



Penalty Testing Error Top 100 Bottom 100 



L2 





100 


100 


LI 


1/20 


62 


1 


Adp-Ll 





53 


1 


Supnorm 


1/20 


53 





Adp-supl 


1/20 


50 





Adp-supII 


1/20 


47 






Then we rank all genes using their marginal relevance in class separation by 
adopting a simple criterion used in Dudoit ct al. (2002). Specifically, the rele- 
vance measure for gene g is defined to be the ratio of between classes sum of 
squares to within class sum of squares as follows: 

E"=i Efc=i HVi = k)(x ig - i 9 T 

where n is the size of the training set, x. g denotes the average expression 
level of gene g for class k observations, and x. g is the overall mean expression 
level of gene g in the training set. To examine the performance of variable 
selection of all different methods, we select the top 100 and bottom 100 genes 
as covariates according the relevance measure R. Our main goal here is to get 
a set of "important" genes and also a set of "unimportant" genes, and to sec 
whether our methods can effectively remove the "unimportant" genes. 

All six MSVMs with different penalties are applied to the training set. We 
use leave-one-out cross validation on the standardized training data with 200 
genes for the purpose of tuning parameter selection and then apply the resulting 
classifiers on the testing data. The results are tabulated in Table 10. All methods 
have either or 1 misclassification on the testing set. In terms of gene selection, 
three sup-norm MSVMs are able to eliminate all bottom 100 genes and they use 
around 50 genes out of the top 100 genes to achieve comparable classification 
performance to other methods. 

In Figure 3, we plot heat maps of both training and testing sets on the left 
and right panels respectively. In these heat maps, rows represent 50 genes se- 
lected by the Type I sup-norm MSVM and columns represent patients. The gene 
expression values are reflected by colors on the plot, with red representing the 
highest expression level and blue the lowest expression level. For visualization, 
we group columns within each class together and use hierarchical clustering with 
correlation distance on the training set to order the genes so that genes close 
to each other have similar expressions. From the left panel on Figure 3, we can 
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Training data Test data 




Fig 3. Heat maps of the microarray data. The left and right panels represent the training and 
testing sets respectively. 

observe four block structures associated with four classes. This implies that the 
50 genes selected are highly informative in predicting the tumor types. For the 
testing set shown on the right panel, we can still see the four blocks although 
the structure and pattern are not as clean as the training set. It is interesting 
to note that several genes in the testing set have higher expression levels, i.e., 
more red, than the training set. In summary, we conclude that the proposed 
sup-norm MSVMs are indeed effective in performing simultaneous classification 
and variable selection. 

7. Discussion 

As pointed out in Laffcrty and Wasserman (2006), sparse learning is an impor- 
tant but challenging issue for high dimensional data. In this paper, we propose 
a new rcgularization method which applies the sup-norm penalty to the MSVM 
to achieve variable selection. Through the new penalty, the natural group effect 
of the coefficients associated with the same variable is embedded in the reg- 
ularization framework. As a result, the sup-norm MSVMs can perform better 
variable selection and deliver more parsimonious classifiers than the L\ MSVMs. 
Moreover, our results show that the adaptive procedures work very well and im- 
prove the corresponding nonadaptive procedures. The adaptive L\ procedure 
can in some settings be as good as and sometimes better than the adaptive 
supnorm procedures. As a future research direction, we will further investigate 
the theoretical properties of proposed methods. 

In some problems, it is possible to form groups among covariates. As argued 
in Yuan and Lin (2006) and Zou and Yuan (2006), it is advisable to use such 
group information in the model building process to improve accuracy of the 
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prediction. The notion of "group lasso" has also been studied in the context of 
learning a kernel (Micchelli and Pontil, 2007) . If such kind of group information 
is available for multicategory classification, there will be two kinds of group 
information available for model building, one type of group formed by the same 
covariate corresponding to different classes as considered in the paper and the 
other kind formed among covariates. A future research direction is to combine 
both group information to construct a new multicategory classification method. 
We believe that such potential classifiers can outperform those without using 
the additional information. 

This paper focuses on the variable selection issue for supervised learning. 
In practice, semi-supervised learning is often encountered, and many methods 
have been developed including Zhu et al. (2003) and Wang and Shcn (2007a). 
Another future topic is to generalize the sup-norm penalty to the context of 
semi-supervised learning. 

Appendix 

Proof of Proposition 2.1: Without loss of generality, assume that {w\ 3 ■, W2j , Wsj } 
are all nonzero. Because of the sum-to-zero constraint Wij + u>2j + u>3j = 0, 
there must be one component out of {w±j, W2j, wsj} has a different sign from 
the other two. Suppose the sign of wij differs from the other two and then 
\wij\ = \ui2j\ + \wsj\ by the sum-to-zero constraint. Consequently, we have 
\wij\ = m&x{\wij\, \w 2 j\, \w 3 j\}- Therefore, £] fe=1 \ w kj\ = SHw^Hoo. The equiv- 
alence of problem (2.2) with the tuning parameter A and problem (2.6) with the 
tuning parameter 2A can be then established. This completes the proof. 
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